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Abstract 

Background: Reticulate events play an innportant role in deternnining evolutionary relationships. The problenn of 
connputing the minimum number of such events to explain discordance between two phylogenetic trees is a hard 
computational problem. Even for binary trees, exact solvers struggle to solve instances with reticulation number larger 
than 40-50. 

Results: Here we present CycleKiller and NonbinaryCycleKiller, the first methods to produce solutions verifiably 
close to optimality for instances with hundreds or even thousands of reticulations. 

Conclusions: Using simulations, we demonstrate that these algorithms run quickly for large and difficult instances, 
producing solutions that are very close to optimality. As a spin-off from our simulations we also present TerminusEst, 
which is the fastest exact method currently available that can handle nonbinary trees: this is used to measure the 
accuracy of the NonbinaryCycleKiller algorithm. All three methods are based on extensions of previous theoretical 
work (SIDMA 26(4):1 635-1 656, TCBB 1 0(1 ):1 8-25, SIDMA 28(1 ):49-66) and are publicly available. We also apply our 
methods to real data. 
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Background 

Phylogenetic trees are used in biology to represent the 
evolutionary history of a set A* of species (or taxa) [1,2]. 
They are trees whose leaves are bijectively labeled by A' 
and whose internal vertices represent the ancestors of 
the species set; they can be rooted or unrooted. Since 
in a rooted tree edges have a direction, the concepts 
of indegree and outdegree of a vertex are well defined. 
Binary rooted (phylogenetic) trees are rooted (phyloge- 
netic) trees whose internal vertices have outdegree 2. 
Nonbinary rooted (phylogenetic) trees have no restriction 
on the outdegree of inner vertices. 

Biological events in which a species derives its genes 
from different ancestors, such as hybridization, recom- 
bination and horizontal gene transfer events, cannot be 
modelled by a tree. To be able to represent such events. 
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a generalization of trees is considered which allows ver- 
tices with indegree two or higher, known as reticulations. 
This model, which is called a rooted phylogenetic network, 
is of growing importance to biologists [3]. For detailed 
background information we refer the reader to [4-6]. 

Although phylogenetic networks are more general than 
phylogenetic trees, trees are still often the basic build- 
ing blocks from which phylogenetic networks are con- 
structed. Specifically, there are many techniques available 
for constructing gene trees. However, when more genes 
are analyzed, topological conflicts between individual 
gene phylogenies can arise for methodological or biolog- 
ical reasons (e.g. aforementioned reticulate phenomena 
such as hybridization). This has led computational biolo- 
gists to try and quantify the amount of reticulation that is 
needed to simultaneously explain two trees. 

To state this problem more formally, we have that a phy- 
logenetic tree T on A' is a refinement of a phylogenetic 
tree T' on the same set X ii T can be obtained from T' 
by deleting edges and identifying their incident vertices. 
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Then, we say that a phylogenetic network AT on A' displays 
a phylogenetic tree T on X ii T can be obtained from 
a subgraph of N by contracting edges. Informally, this 
means that (a refinement of) T can be obtained from N 
by, for each reticulation vertex of Ny "switching off" all 
but one of its incoming edges and then suppressing all 
indegree-1 outdegree-1 vertices (i.e. replacing paths of 
these vertices by one edge). Given two rooted phyloge- 
netic trees Ti and T2 on Af, the problem then becomes 
to determine the minimum number of reticulation events 
contained in a phylogenetic network N on X display- 
ing both trees (where an indegree-(i reticulation counts 
as <i — 1 reticulation events). The value we are minimiz- 
ing is often called the hybridization number and instead 
of the term phylogenetic network, the term hybridization 
network is often used. It is known that the problem of 
computing hybridization numbers is both NP-hard and 
APX-hard [7], but it is not known whether it is in APX 
(i.e. whether it admits a polynomial-time approximation 
algorithm that achieves a constant approximation ratio). 

Until recently, most research on the hybridization num- 
ber of two phylogenetic trees had focused on the question 
of how to exactly compute this value using fixed param- 
eter tractable (FPT) algorithms, where the parameter in 
question is the hybridization number r of the two trees. 
For an introduction to FPT we refer to [8,9]. 

For binary trees, algorithmic progress has been consid- 
erable in this area, with various authors reporting increas- 
ingly sophisticated FPT algorithms [10-13]. The fastest 
algorithms currently implemented are the algorithm 
available inside the package Dendroscope [14], based 
on [15], and the sequence of progressively faster algo- 
rithms in the HybridNet family [11,16,17]. The fastest 
theoretical FPT algorithm has running time 0{?>.lWn) 
[13], where n is the number of taxa in the trees. 

Even though in practice it rarely happens that trees are 
binary, the nonbinary variant of the problem has been less 
studied. The nonbinary version is also FPT [18,19] and a 
(non-FPT) algorithm has recently been implemented in 
Dendroscope [14]. 

Such (FPT) algorithms do, however, have their limits. 
The running time still grows exponentially in r, albeit usu- 
ally at a slower rate than algorithms that have a running 
time of the form r^^^\ where/ is some function of r. In 
practice this means that existing algorithms can only han- 
dle instances of binary trees when r is at most 40-50 and 
instances of nonbinary trees when r is at most 5-10. 

These limitations are problematic. Due to ongoing 
advances in DNA sequencing, more and more species 
and strains are being sequenced. Consequently, biologists 
use trees with more and more taxa and software that 
can handle large trees is required. For such large and/or 
difficult trees one can try to generate heuristic or approx- 
imate solutions, but how far are such solutions from 



optimality? In [20] we showed that the news is worrying. 
Indeed, we showed that polynomial-time constant-ratio 
approximation algorithms exist if and only if such algo- 
rithms exist for the problem Directed Feedback Vertex 
Set (DFVS). However, DFVS is a well-studied problem in 
combinatorial optimization and to this day it is unknown 
if it permits such an algorithm. Pending a major break- 
through in computer science, it therefore seems difficult 
to build polynomial-time algorithms which approximate 
hybridization number well. On the positive side, we 
showed that in polynomial time an algorithm with approx- 
imation ratio O(logrloglogr) is possible. However, this 
algorithm is purely of theoretical interest and is not useful 
in practice. 

New algorithms: CycleKiller and NonbinaryCycleKiller 

In this article we extend the theoretical work of [20] 
slightly and give it a practical twist to yield a fast approx- 
imation algorithm which we have made publicly available 
as the program CycleKiller. Furthermore, we give an 
implementation of the algorithm presented in [21], avail- 
able as NonbinaryCycleKiller. 

The worst-case running time of these approximation 
algorithms is exponential. However, as we demonstrate 
with experiments, the running time of our algorithms is 
in practice extremely fast. For large and/or massively dis- 
cordant binary trees, CycleKiller is typically orders of 
magnitude faster than the HybridNet algorithms and 
the algorithm in DENDROSCOPE. The performance gap 
between NonbinaryCycleKiller and its exact coun- 
terparts is less pronounced, but still significant, especially 
in its fastest mode of operation. 

Of course, exact algorithms attempt to compute opti- 
mum solutions, whereas our algorithms only give approx- 
imate solutions. Nevertheless, our experiments show that 
when CycleKiller and NonbinaryCycleKiller are 
run in their most accurate mode of operation, an approx- 
imation ratio very close to 1 is not unusual, suggest- 
ing that the algorithms often produce solutions close to 
optimality and well within the worst-case approximation 
guarantee. 

The idea behind the binary and nonbinary algorithm 
is similar. Specifically, we describe an algorithm with 
approximation ratio d{c +1) for the hybridization num- 
ber problem on two binary trees and an algorithm with 
approximation ratio d{c + 3) for the hybridization num- 
ber problem on two nonbinary trees by combining a 
c- approximation for the problem MAF (Maximum Agree- 
ment Forest) with a /i-approximation for the problem 
DFVS. Both these problems are NP-hard so polynomial- 
time algorithms attaining c = 1 ov d = 1 are not 
realistic. Nevertheless, there exist extremely fast FPT 
algorithms for solving MAF on binary trees exactly (i.e. 
c = 1), the fastest is rSPR by Whidden, Beiko and Zeh 
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[22,23] although the MAF algorithm inside [17] is also 
competitive. Moreover, we observe that the type of DF VS 
instances that arise in practice can easily be solved using 
Integer Linear Programming (ILP) (and freely-available 
ILP solver technology such as GLPK), so d = 1 is also 
often possible. 

Combining these two exact approaches gives us, in 
the binary case, an exponential- time approximation algo- 
rithm with worst-case approximation ratio 2 that for 
large instances still runs extremely quickly; this is the 
2 -approx option of CycleKiller. In practice, we have 
observed that the upper bound of 2 is often pessimistic, 
with much better approximation ratios observed in exper- 
iments (1.003 on average for the simulations presented in 
this article). We find that this algorithm already allows 
us to cope with much bigger trees than the HybridNet 
algorithms or the algorithm in Dendroscope. 

Nevertheless, for truly massive trees it is often not feasi- 
ble to have c = 1. Fortunately there exist linear- time algo- 
rithms which achieve c = 3 [13]. This, coupled with the 
fact that (even for such trees) it remains feasible to use an 
exact {d = 1) solver for DFVS, means that in practice we 
achieve a 4-approximation for gigantic binary trees; this is 
the 4 -approx option of CycleKiller. Again, the ratio 
of 4 is a worst-case bound and we suspect that in practice 
we are doing much better than 4. However, this cannot 
be experimentally verified due to the lack of good lower 
bounds for such massive instances. In any case, the main 
advantage of this option is that it can, without too much 
effort, cope with trees with hundreds or thousands of taxa 
and hybridization number of a similar order of magnitude. 
An implementation of CycleKiller and accompanying 
documentation can be downloaded from http://skelk.sdf- 
eu.org/cyclekiller. Networks created by the algorithm can 
be viewed in DENDROSCOPE. 

For the nonbinary case, there also exist exact and 
approximation algorithms for MAF [13,21,24]. In case 
when one of the input trees is binary we can still use the 
exact (thus c = 1) and approximate (c = 3) algorithms 
given in [13] (referred to as rSPR) to obtain respectively a 
4-approximation and a 6-approximation of the hybridiza- 
tion number problem for nonbinary trees. When both 
input trees are nonbinary, then we must use the somewhat 
less optimized exact (c = 1) and approximate (c = 4) 
algorithms described in [21]. We then obtain 4- and 7- 
approximations (because in the nonbinary case d = 1 is 
still easily attainable using ILP). 

To measure the approximation ratios attained by NON- 
BINARyCycleKiller in practice we have also imple- 
mented and made publicly available the exact nonbinary 
algorithm TerminusEst, based on the theoretical results 
in [18]. TerminusEst will be of independent interest 
because it is currently the fastest exact nonbinary solver 
available. 



CycleKiller, NonbinaryCycleKiller and Termi- 
nusEst can be downloaded respectively from http:// 
skelk.sdf-eu.org/cyclekiller [25], from http://homepages. 
cwi.nl/~iersel/cyclekiller [26], and from http://skelk.sdf- 
eu.org/terminusest [27]. 

Theoretical and practical significance 

We have described, implemented and made publicly avail- 
able two algorithms with two desirable qualities: they ter- 
minate quickly even for massive instances of hybridization 
number and give a non-trivial guarantee of proximity to 
optimality. These are the first algorithms with such prop- 
erties. Both algorithms are based on a non-trivial marriage 
of MAF and DFVS solvers (both exact and approximate), 
meaning that further advances in solving MAF and DFVS 
will directly lead to improvements in CycleKiller and 
NonbinaryCycleKiller. 

This article also improves the theoretical work given in 
[20], which also proposed using DFVS but beginning from 
a trivial Agreement Forest (AF) known as a chain forest 
Here we use a smarter starting point: an (approximate) 
MAF, and it is this insight which makes a 2-approximation 
(rather than the 6-approximation impUed by [20]) possi- 
ble when using an exact DFVS solver. Other articles have 
also had the idea of cycle-breaking in AFs: the advanced 
FPT algorithm of Whidden et al. [13] - which has not 
been implemented - and the algorithms in the aforemen- 
tioned HybridNet family. However, both algorithms start 
the cycle-breaking from many starting points. In contrast, 
our algorithm requires only a single starting point, i.e. a 
single (approximate) solution to MAF. 

Here, we only present the theory behind the binary algo- 
rithm. The nonbinary case is more involved and we refer 
the reader to [21] in which we introduce it. Note that 
our results for the binary case do not follow from the 
results for the nonbinary case in [21] because here we 
obtain a better constant in the approximation ratio. After 
a presentation of the binary algorithm in Section "The 
algorithm for binary trees", we will show the results of 
some experiments with binary trees in Section "Practical 
experiments with binary trees" and nonbinary trees in 
Section "Practical experiments with nonbinary trees". 
Finally, in Section "Practical experiments on biologically 
relevant trees" we demonstrate that both TerminusEst 
and NonbinaryCycleKiller are easily capable of gen- 
erating optimal (respectively, nearly optimal) solutions 
on a real biological dataset originally obtained from the 
GreenPhylDB database. 

Technical note 

At the time the experiments on binary trees were con- 
ducted (i.e. for the preliminary version of this article 
[28]) HybridNet was the fastest algorithm available in 
its family. It has recently been superceded by the faster 
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UltraNet [17]. We believe, however, that it is nei- 
ther necessary nor desirable to re-run the binary exper- 
iments, for the following reasons. In the same period 
the solver rSPR has also increased dramatically in speed 
(it is now at vl.2), leading to a corresponding speed-up 
in CycleKiller. In fact, both rSPR and the algorithms 
in the HybridNet family are constantly in flux and are 
always being improved, so any experimental setup is prone 
to age extremely quickly. However, the conclusions that 
we can derive from these experiments are unlikely to 
change much over time. Given that the algorithms in the 
HybridNet family (and the theoretical algorithm in [13]) 
implicitly have to explore exponentially many optimal and 
sub-optimal solutions to the MAP problem, the running 
time of MAP solvers (and thus also CycleKiller) is 
likely for the foreseeable future to remain much better 
than the running time of solvers for hybridization number. 
The central message is stable: approximating hybridiza- 
tion number by splitting it into MAP and DPVS instances 
yields extremely competitive approximation ratios for 
instances that exact hybridization number solvers will 
probably never be able to cope with. 

Methods 

Preliminaries 

Let A* be a finite set (e.g. of species). A rooted phylogenetic 
X'tree is a rooted tree with no vertices with indegree 1 
and outdegree 1, a root with indegree 0 and outdegree at 
least 2, and leaves bijectively labelled by the elements of X, 
We identify each leaf with its label and use L{T^ to refer 
to the leaf set (or label set) of T. A rooted phylogenetic 
A'-tree is called binary if each nonleaf vertex has outde- 
gree two. We henceforth call a rooted, binary phylogenetic 
A'-tree a tree for short. Por a tree T and a set X' c X, 
we use the notation T{X') to denote the minimal subtree 
of T that contains all elements of X' and T\X' denotes the 
result of suppressing all indegree- 1 outdegree- 1 vertices 
in T{X'), 

The following definitions apply only to binary trees. 
Definitions for nonbinary trees are analogous but slightly 
more technical [21]. 

We define 2i for est as a set of trees. Each element of a for- 
est is called a component. Let T be a tree and T a forest. 
We say that is a forest for T if, for all F e T\L(F) is 
isomorphic to F and the trees { T(L{F)), F e are vertex- 
disjoint subtrees of T whose leaf-set union equals L{T), 
If Ti and J2 are two trees, then a forest T is an agree- 
ment forest of Ti and T2 if it is a forest for Ti and T2. The 
number of components of T is denoted 

We define cleaning up a directed graph as repeatedly 
suppressing indegree- 1 outdegree- 1 vertices, removing 
indegree-0 outdegree- 1 vertices and removing unlabelled 
outdegree-0 vertices until no such operation is possible. 
Observe that, if is a forest for T, T can be obtained 



from T by removing | — 1 edges and cleaning up. Prom 
now on we consider Ti, T2 as trees on the same taxon set. 

Problem: Maximum Agreement Porest (MAP) 
Instance: Two rooted, binary phylogenetic trees Ti 
and 

Solution: An agreement forest T of Ti and T2. 
Objective: Minimize \T\ — 1. 

The directed graph IG(Ti,T2,J-'), called the inheritance 
graph, is the directed graph whose vertices are the com- 
ponents of T and which has an edge (F, F') precisely if 
either 

• there is a directed path in Ti from the root of 
Ti(L(F)) to the root of Ti(L(F')) or; 

• there is a directed path in T2 from the root of 
T2(L(F)) to the root of T2{L{F')), 

An agreement forest F of T\ and T2 is called an acyclic 
agreement forest if the graph IG(Ti, T2, J^) is acyclic. A 
maximum acyclic agreement forest (MAAF) of Ti and T2 
is an acyclic agreement forest of Ti and T2 with a mini- 
mum number of components. 

Problem: Maximum Acyclic Agreement Porest (MAAP) 
Instance: Two rooted, binary phylogenetic trees Ti 
and 

Solution: An acyclic agreement forest T of Ti and T2. 
Objective: Minimize \T\ — 1. 

We use MAP(ri, T2) and MAAP(ri, T2) to denote the 
optimal solution value of the problem MAP and MAAP 
respectively, for an instance Ti, T2^ 

A rooted phylogenetic network on A' is a directed acyclic 
graph with no vertices with indegree 1 and outdegree 1 
and leaves bijectively labelled by the elements of X, 
Rooted phylogenetic networks, which are sometimes also 
called hybridization networks, will henceforth be called 
networks for short in this paper. A tree T on A' is displayed 
by a network A/^ if T can be obtained from a subtree of A/^ by 
contracting edges. A reticulation is a vertex v with 5 ~ (v) > 
2 (with (5~ (v) denoting the indegree of v). The reticulation 
number (sometimes also called hybridization number) of 
a network N with root p is given by 

r(A^) = ^(5-(v)-l). 

It was shown that the optimum to MAAP is equal to the 
optimum of the following problem [29] . 

Problem: MinimumHybridization 

Instance: Two rooted binary phylogenetic trees Ti 

and T2. 
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Solution: A rooted phylogenetic network N that displays 
Ti and 72. 

Objective: Minimize r(N). 

Moreover, it was shown that, for two trees Ti, T2, any 
acyclic agreement forest for Ti and T2 with k -\- 1 com- 
ponents can be turned into a phylogenetic network that 
displays Ti and T2 and has reticulation number /c, and 
vice versa. Thus, any approximation for MAAF gives an 
approximation for MinimumHybridization. 

Finally, a feedback vertex set of a directed graph is a sub- 
set of the vertices that contains at least one vertex of each 
directed cycle. Equivalently, a subset of the vertices of a 
directed graph is a feedback vertex set if removing these 
vertices from the graph makes it acyclic. 

Problem: Directed Feedback Vertex Set (DF VS) 
Instance: A directed graph D. 

Goal: Find a feedback vertex set of D of minimum size. 

We note that the definition of MINIMUMHYBRIDIZA- 
TION easily generalises to nonbinary trees, since the def- 
inition of display allows the image of each input tree in 
the network to be more "resolved" than the original tree. 
However, the definitions of (acyclic) agreement forests are 
different in the nonbinary case [21]. 

The algorithm for binary trees 

We show how MAAF can be approximated by combin- 
ing algorithms for MAF and DFVS. In particular, we will 
prove the following theorem. 

Theorem 1. If there exists a c- approximation for MAF 
and a d-approx-imation for DFVS, then there exists a 
d(c + 1)' approximation for MAAF (and thus for MlNl- 
mumHybridizationJ. 

Note that this theorem does not follow from Theorem 
2.1 of [21], since there the approximation ratio for MAAF 
is a d{c + 3) -approximation. 

To prove the theorem, suppose there exists a c- 
approximation for MAF. Let Ti and T2 be two trees and 
let M be an agreement forest returned by the algorithm. 
Then, 

|M| - 1 < c . MAF(ri, T2)<C' MAAF(ri, T2). (!) 

An M'Splitting is an acyclic agreement forest that can be 
obtained from M by removing edges and cleaning up. 

Lemma 2. Let T\ and T2 be two trees and Man agreement 
forest of Ti and T2. Then, there exists an M -splitting of size 
at most MAAF(Tu T2) + \M\, 

Proof Consider a maximum acyclic agreement forest F 
of Ti and T2. For / G {1, 2}, F can be obtained from Ti by 



removing a set of edges, say E^, and cleaning up. More- 
over, also M can be obtained from Ti by removing a set of 
edges, say and cleaning up. 

Now consider the forest S obtained from Ti by remov- 
ing E]^ U £p and cleaning up. Then, 

• S is an agreement forest of Ti and T2 because it can 
be obtained from T2 by removing edges E^ U E^ and 
cleaning up; 

• 5 is acyclic because it can be obtained by removing 
edges from F, which is acyclic, and cleaning up; 

• S can be obtained from M by removing edges and 
cleaning up. 

Hence, S is an M-splitting. Furthermore, l^l < + 
\E]^\ + 1. The lemma follows since = MAAF(ri, T2) 
and|Af| = |£i,| + L □ 

Let OptSpUttingr^,r2 (^) denote the size of a minimum- 
size M-splitting. Combining Lemma 2 and Eq. 1, we 
obtain 

OptSplittingj^^ - 1 < (c+ l)MAAF(ri, T2) (2) 

We will now show how to find an approximation for 
the problem of finding an optimal M-splitting. We do so 
by reducing the problem to DFVS. We construct an input 
graph D for DFVS (called the extended inheritance graph) 
as follows. For every vertex of M that has outdegree 2 
(in M), we create a vertex in D, There is an edge in D from 
a vertex w to a vertex v precisely if in either Ti or T2 (or in 
both) there is a directed path from u to v. An example is 
in Figure 1. We claim the following. 

Lemma 3. A subset V of the vertices of D is a feedback 
vertex set ofD if and only if removing V from M makes it 
an acyclic agreement forest 

Proof We show that D\V^ has a directed cycle if and 
only if the inheritance graph oiM\V' has a directed cycle. 

To prove this, first suppose that there is a cycle 
vi, V2, . . . , Vk = vi in the inheritance graph of M \ V , The 
vertices in the inheritance graph of M \ V' correspond to 
the roots of the components of M \ V\ Since these roots 
have outdegree 2 in M \ V , they had outdegree 2 in M, 
and are thus vertices of D. So the vertices vi, V2, . . . , v^ that 
form the cycle are vertices of D, Since these vertices are 
in the inheritance graph of M \ V , they can not be in V 
and so they are vertices of D \ V\ The reachability rela- 
tion between these vertices 'mD\V' is the same as in the 
inheritance graph of M \ V . So, the vertices vi, V2, . . . , v^: 
form a cycle in D \ V' . 

Now suppose that there is a cycle wi, vi/2, . . .,Wk = wi 
in D\ V\ Each of the vertices wi, W2f . . . , m/^ is a vertex 
with outdegree-2 in M. Some of them might be roots of 
components, while others are not. However, observe that 
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Figure 1 Two binary trees Ti and T2 and the auxiliary grapii D. A maximum agreement forest M of T] and T2 is obtained by deleting tlie daslied 
edges. Grapli D can be made acyclic by deleting either both filled or both unfilled vertices. Hence, removing either V] and vj or 1/3 and 1/4 from M 
makes it an acyclic agreement forest for T] and Tj, see Lemma 3. The acyclic agreement forest M\{V], vj} obtained by removing V] and vj from M is 
depicted on the right. 



if there is a directed path from a vertex uto^ vertex v in Ti 
(or in T2) then there is also a directed path from the root 
of the component of M \ that contains u to the root of 
the component ofM\V' that contains v. Hence, there is a 
directed cycle in the inheritance graph of M \ V\ formed 
by the roots of the components of M \ that contain 

Proof of Theorem 1. Suppose that there exists a d- 
approximation for DFVS. Let FYS be a feedback vertex 
set returned by this algorithm and let MFVS be a min- 
imum feedback vertex set. Then, removing the vertices 
of MFVS from M gives an optimal M-splitting. Further- 
more, OptSplittingri,r2W = 1^1 + |MFVS|. This is 
because for every vertex in a cycle C, its parent in M must 
participate in some cycle that contains elements of C. So 
if we start by removing the root of the component we 
are splitting and subsequently remove only those vertices 
whose parents have already been removed we see that we 
add at most one component per vertex. In fact, because 
vertices of D all have out-degree 2 in M, we add exactly 
one component per vertex. 

By removing the vertices of FVS from M, we obtain an 
acyclic agreement forest T such that 



= |M| + |FVS|-1 

< \M\+d' |MFVS| - 1 

< d{\M\ + |MFVS| - 1) 

= ^(OptSplittingj^^ j^^CM) - 1) 

< d{c + l)MAAF(ri, 



where the last inequality follows from Eq, 2, Thus, T is 
a d{c + 1) -approximation to MAAF, which concludes the 
proof of Theorem 1. □ 



Theorem 1 implies that a solution to the MAAF prob- 
lem for a given instance can be constructed by (i) finding a 
solution T to the MAF problem for the same instance (ii) 
constructing the extended inheritance graph D for (iii) 
finding a solution V for the DFVS problem on the graph 
D and (iv) modifying F accordantly to V. 

Results and discussion 

Practical experiments with binary trees 

To assess the performance of CycleKiller, a simulation 
study was undertaken. We generated 3 synthetic datasets, 
an easy, a medium and a hard one, containing respec- 
tively 800, 640 and 640 pairs of rooted binary phylogenetic 
trees. 

The easy data set was created by varying two param- 
eters, namely the number of taxa n and the number of 
rSPR-moves k used to obtain the second tree from the first 
(note that this number is an upper bound on the actual 
rSPR distance). The 800 pairs of rooted binary phyloge- 
netic trees were created by varying n in {20, 50, 100, 200} 
and k in {5, 10, 25}, and then creating 40 different 
instances per each combination of parameters. Each pair 
(Ti, T2) of rooted binary phylogenetic trees for a given set 
of parameters n and k is created as follows: The first tree 
Ti on X = {xi, . . . ,Xn} '^^ generated by first creating a set 
of n leaf vertices bijectively labeled by the set X, Then, 
two vertices u and v, both with indegree 0, are randomly 
picked and a new vertex along with two new edges 
{w, u) and {w, v), is created. This is done until only one 
vertex with no ancestor, the root, is present. The second 
tree T2 is obtained from Ti by applying k rSPR-moves. 
The medium and the hard data sets were generated in the 
same way as the easy one, but for different choices of the 
parameters: n in {50, 100, 200, 300} and k in {15, 25, 40, 55} 
for the medium one and n in {100, 200, 400, 500} and k in 
{40, 60, 80, 100} for the hard one. 
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The exact hybridization number has been computed by 
HybridNet [11], available from http://www.cs.cityu.edu. 
hk/~lwang/software/Hn/treeComp.html or with Den- 
DROSCOPE [14], available from http://dendroscope.org/. 
We will refer to these algorithms as the exact algorithms. 
Each instance has been run on a single core of an Intel 
Xeon E5506 processor. 

Each run that took more than one hour was aborted. 
For each instance, we ran our program with the 
option 2-approx, and, in case the latter did not 
finish within one hour, we ran it again, this time 
using the option 4-approx, always with a one-hour 
limit (see Section "New algorithms: CycleKiller and 
NonbinaryCycleKiller"). We used the program rSPR 
vl.03 [22,23] to solve or approximate MAE and GLPK 
v4.47 (http://www.gnu.org/software/glpk/) to solve the 
following simple polynomial-size ILP formulation of 
DEVS: 

min ^ Xy 

veV 

s.t. 

0<€v< l^l-l for all VG V 

lv>lu + l- \ V\Xu - \ V\Xv for all e=iu,v) eE 
lyeZ for 2illveV 

Xy e {0, 1} for all V G F 

Given a directed graph D = the binary vari- 

ables Xy model whether a vertex is in the feedback vertex 
set, and the integer variables ly model the positions of the 
surviving vertices in the induced topological order. The 
edge constraints enforce the topological order. Note that 
an edge constraint is essentially eliminated if one or both 
endpoints of the edge are in the feedback vertex set. 

Eor all instances of the easy data set, CycleKiller fin- 
ished with the 2-approx option within the one hour 
limit, while for 33 instances the exact algorithms were 
unable to compute the hybridization number. Note that, 
even for "easy" instances, computing the exact hybridiza- 
tion number can take a very long time. To give the 
reader an idea, for 9 runs of the easy data, Dendro- 
SCOPE and HybridNet did not complete within 10 days. 
Table 1 shows a summary of the results. It can be seen 
that CycleKiller was much faster than the exact algo- 
rithms. Moreover, for 96.6% of the instances for which an 
exact algorithm could find a solution, CycleKiller also 
found an optimal solution. While the theoretical worst- 
case approximation ratio of the 2-approx option of 
CycleKiller is 2, in our experiments it performed very 
close to a 1 -approximation. 

Eor the medium data set, CycleKiller finished with 
the 2-approx option for 613 instances, and for the 
remaining ones with the 4-approx option. The exact 
algorithms could compute the hybridization number for 



only 199 instances (out of 640). Eor 97.5% of these 
instances, CycleKiller also found an optimal solution, 
but with a much better running time. Regarding the hard 
data set, 444 runs were completed with the 2-approx 
option and for the remaining ones we were able to use 
the 4-approx option within the given time constraint. 
Unfortunately, the exact algorithms were unable to com- 
pute the hybridization number for any tree-pair of this 
data set and hence we could not compute the average 
approximation ratios. Over all our experiments, the max- 
imum hybridization number that the exact algorithms 
could handle was 25^. In contrast, the 2-approx option 
of CycleKiller could be used for instances for which the 
size of a MAE was up to 97, and thus for instances for 
which the hybridization number was at least 97. 

To find the limits of the 4-approx option of 
CycleKiller, we also tested it on randomly generated 
trees. On a normal laptop, it could construct networks 
with up to 10,000 leaves and up to 10,000 reticulations 
within 10 minutes. Since the number of reticulations 
found is at most four times the optimal hybridization 
number, this implies that the 4-approx option of 
CycleKiller can handle hybridization numbers up to at 
least 2,500. These randomly generated trees are, however, 
biologically meaningless and, therefore, we conducted the 
extensive experiment described above on trees generated 
by rSPR moves. Einally we note that over all experiments 
the worst approximation ratio we encountered was 1.2. 

Practical experiments with nonbinary trees 

To run the simulations with NonbinaryCycleKiller, 
we used a subset of the trees from the easy set of binary 
experiments. We then applied random edge contractions 
in order to obtain nonbinary trees. Hence, we have the 
same two parameters as before, namely the number of 
taxa n g {20, 50, 100} and the number of rSPR-moves 
k G {5, 10, 15, 20}, and an additional parameter p g 
{25, 50, 75} which measures the percentage of the edges 
of an original binary tree that were contracted in order to 
obtain a nonbinary tree. We could only use smaller values 
of n and k from the easy set of experiments because exact 
solvers for nonbinary MAE (upon which NonbinaryCy- 
CLEKiller is built) and exact solvers for nonbinary MlN- 
imumHybridization (which is important to measure 
the accuracy of NonbinaryCycleKiller in practice) are 
slower than their binary counterparts. 

We performed two runs of experiments'^. One run with 
instances consisting of one binary and one nonbinary tree, 
and one run with instances consisting of two nonbinary 
trees. 

Eor the experiments with one binary and one non- 
binary tree, we were still able to use the RSPR algo- 
rithm [13,22], which has a better running time and 
approximation ratio compared to the available algo- 
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Table 1 Experimental results for instances with two binary trees 



Exact algorithms CycleKiller 



Dataset 


Total runs 


Completed 


Time 


2 - approx 


4 - approx 


Ratio 


Opt. found 










Compl. 


Time 


Compl. 


Time 






Easy 


800 


767 


798 


800 


3 






1.003 


96.6% 


Medium 


640 


199 


2572 


613 


212 


27 


<1 


1.002 


97.5% 


Hard 


640 


0 


3600 


440 


1271 


200 


1.5 







The third column indicates for how many instances at least one exact algorithm finished within one hour. The fifth column indicates for how many instances the 
2 - approx option of CycleKiller finished within one hour. For the remaining instances, the 4 - approx option finished within one hour, as can be seen from the 
seventh column. The average running time for the 2 -approx and the 4 -approx in seconds are reported respectively in the sixth and eighth column. The average 
approximation ratio (ninth column) is taken over all instances for which at least one exact method finished. The last column indicates the percentage of those 
instances for which CycleKiller found an optimal solution. 



rithm for two nonbinary trees. When rSPR is used in 
exact mode, NonbinaryCycleKiller yields a theoret- 
ical worst-case approximation ratio of 4. When rSPR 
is used in its 3-approximation mode, NonbinaryCy- 
cleKiller yields a theoretical worst-case approximation 
ratio of 6 (see Section "New algorithms: CycleKiller 
and NonbinaryCycleKiller"). The results of this run 
are summarized in Table 2. 

For the experiments with two nonbinary trees, the rSPR 
software can no longer be used, and instead we used 
the exact and 4-approximate MAF algorithm described 
in [21]. This makes NonbinaryCycleKiller be- 
have as a 4- approximation and 7-approximation respec- 
tively (see Section "New algorithms: CycleKiller and 
NonbinaryCycleKiller"). Note that the exact algo- 
rithm [21] is considerably slower than RSPR, meaning that 
in practice NONBINARYCYCLEKILLER struggles with two 
nonbinary trees more than when at most one of the trees 
is nonbinary. The results for this run are summarized in 
Table 3. 

The exact hybridization number in both runs was com- 
puted by TerminusEst [27]. 

Each instance that took longer than 10 minutes to com- 
pute was aborted and the running time was set to 600 



seconds. The averages of the running- times are taken over 
all instances, with running-time taken to be 600 if the 
program timed out for that instance. (We used a shorter 
time-out than in the binary experiments because of the 
observation that, in the nonbinary case, exact algorithms 
running longer than 10 minutes almost always took longer 
than 60 minutes too.) 

Note that we did not compare the performance of 
NonbinaryCycleKiller to Dendroscope because 
TerminusEst has better running times than the exact 
nonbinary MinimumHybridization solver inside DEN- 
DROSCOPE (data not shown). 

To enable a clearer analysis we divided the trees into 
representative "simple" and "tricky" ones based on two 
parameters, n and k. Parameter values for the simple set 
were n e {20, 50}, k e {5, 10, 15} and for the tricky set 
n e {50, 100}, k = 20. In addition we varied the percent- 
age of contracted edges (in a single tree in the first run and 
in both trees in the second run). 

In Table 2 we show running times and solution qual- 
ity of our algorithm when one of the input trees is binary. 
For the simple set of instances (regardless of the per- 
centage of edge-contractions) we see that the more accu- 
rate version of our algorithm, the 4-approximation, had 



Table 2 Summary of results for Instances with one binary and one nonbinary tree 

TerminusEst NonbinaryCycleKiller 



4 -approx 6 -approx 



Contr. 


Dataset 


opt 


Time 


m 


Time 


Ratio 


m 


Time 


Ratio 


25% 


Simple 


7.504 


8.004 


7.567 


0.967 


1.007 


11.421 


0.996 


1.532 




Tricky 


17.000 


203.650 


17.288 


3.675 


1.003 


27.238 


3.638 


1.600 


50% 


Simple 


6.736 


9.896 


6.829 


0.942 


1.008 


10.900 


0.925 


1.639 




Tricky 


14.976 


374.263 


16.288 


3.388 


1.006 


26.413 


3.438 


1.640 


75% 


Simple 


5.139 


1 2.304 


5.263 


0.867 


1.011 


8.692 


0.963 


1.659 




Tricky 


10.500 


391.575 


1 3.475 


3.263 


1.006 


23.200 


3.275 


1.633 


Worst case 


20 


600 


22 


15 


1.75 


37 


13 


3 





We list the average hybridization number found {opt and r{N)), the average running time in seconds (Time) and where applicable the average approximation ratio 
(Ratio) for the three algorithms. 
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Table 3 Summary of results for Instances with two nonblnary trees 

TerminusEst NonbinaryCycleKiller 



4 - approx 7 - approx 



Contr. 


Dataset 


opt 


Time 


m 


Time 


Ratio 


m 


Time 


Ratio 


25% 


Simple 


7.168 


12.971 


7.240 


43.967 


1.032 


16.338 


2.463 


2.343 




Tricky 


16.148 


279.100 








35.638 


7.000 


2.193 


50% 


Simple 


5.933 


11.150 


5.900 


41.325 


1.030 


13.721 


2.004 


2.405 




Tricky 


13.216 


379.238 








32.363 


7.200 


2.331 


75% 


Simple 


3.654 


1.121 


3.729 


4.208 


1.015 


9.075 


1.483 


2.590 




Tricky 


8.672 


183.150 








21.950 


5.800 


2.294 




Worst case 


20 


600 


29 


600 


1.5 


56 


22 


4 



The layout of the table is the same as that of Table 2. 



a better running time than the exact algorithm, and at 
the same time had an average approximation ratio very 
close to 1. Far more interesting is to see what happens 
with tricky instances. As predicted, the running time of 
the exact algorithm is much higher for tricky instances 
due to the higher hybridization numbers. On the other 
hand, the running time of the 4-approximation does not 
rise significantly at all, whilst still attaining an approxi- 
mation ratio again very close to 1. Another thing to note 
is that the percentage of contraction only seems to affect 
the running time of the exact algorithm. The practical 
worst-case approximation ratio observed in these exper- 
iments was 1.75 for the 4-approximation and 3 for the 
6-approximation. 

Table 3 shows our results on instances with two non- 
binary trees. The exact algorithm for MAF is in this case 
much slower and this affects the running times even for 
the simple set. While the 4-approximation version has an 
average approximation ratio very close to 1 again, the run- 
ning time is in this case worse than that of TerminusEst. 
For the tricky set the situation is even more significant; 
the exact MAF algorithm cannot deal with reticulation 
numbers above 15, while TerminusEst can get slightly 
further. On the other hand, the 7-approximation still runs 
much faster than TerminusEst, both for simple and 
tricky instances, while having an average approximation 
ratio of less than 2.6. The practical worst-case approxima- 
tion ratio observed in these experiments was 1.5 for the 
4-approximation and 4 for the 7-approximation. 

It is worth noting that, for the 4-approximation, the run- 
ning time for the 75%-contraction trees is considerably 
lower than the one for the 50%-contraction trees. This 
is due to the fact that a high contraction in both trees 
causes the hybridization number of the instance to drop, 
and a lower hybridization number leads to a better run- 
ning time. Also note that the exact solver TerminusEst 
seems more able to cope with the tricky 25%-contraction 
instances than the tricky 50%-contraction instances. This 



is probably because, although low contraction rates yield a 
higher hybridization number, the trees remain "relatively 
binary" and this can induce more efficient branching in 
the underlying FPT algorithm [18]. It is plausible that with 
50%-contraction the instances suffer from the disadvan- 
tage of relatively high hybridization number without the 
branching advantages associated with (relatively) binary 
trees. 

To find the limits of the 7 - approx option of NBCK, we 
also tested it on huge, biologically meaningless, randomly 
generated trees. Below some results: 

• 1000 leaves, 25%-contraction, on average 995 
reticulations in 63 sec. 

• 1000 leaves, 50%-contraction, on average 989 
reticulations in 82 sec. 

• 1000 leaves, 75%-contraction, on average 840 
reticulations, in 656 sec. 

Computation times of this last run of experiments do not 
include the network construction. 

Practical experiments on biologically relevant trees 

Finally, we tested our methods on phylogenetic trees 
obtained from GreenPhylDB [30] - version 3, a database 
containing twenty-two full genomes of members of the 
plantae kingdom, ranging from algae to angiosperms. 
We were able to retrieve from the database the 9903 
rooted phylogenetic trees associated to the gene fami- 
lies contained in the database (the gene trees), along with 
the rooted phylogenetic tree describing the history of 
the twenty-two species contained in GreenPhylDB (the 
species tree). Note that the species tree for these species 
is not completely resolved, i.e. it is nonbinary. Among the 
gene trees, 2769 contain less than 3 species and they were 
discarded. Of the remaining 7134 trees, only 204 were 
directly usable for testing our methods. Indeed, because of 
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Table 4 Summary of results for dataset Fi (204 gene trees) 
originally obtained from GreenPhylDB database 





MIN 


AVG 


MAX 


Common taxa 


3 


5.235 


20 


opt 


0 


0.873 


7 


Ratio 4-approx 


1 


1.002 


1.2 


Ratio 6-approx 


1 


1.088 


3 


Gapa-EST-MAF) 


0 


0.010 


1 


Gap (4-approx - MAF) 


0 


0.020 


2 


TimeT-EST 


0 


0.221 


3 


Time 4-approx 


0 


0.270 


1 



Common taxa is the number of taxa after restricting the gene tree and the 
species tree to common taxa. opt is the exact hybridization number, as 
computed by TerminusEst. Ratio 4-approx (resp. 6-approx) is the ratio of the 
solution obtained by NonbinaryCycleKiller (running in 4-approx, resp. 
6-approx mode) to the solution obtained by TerminusEst. Gap (T-EST- MAF) is 
the absolute gap between the optimum MAF solution (here computed with 
rSPR) and the exact hybridization number, as computed by TerminusEst. Gap 
(4-approx - MAF) is the absolute gap between the optimum MAF solution and 
the reticulation number of the solution generated by NonbinaryCycleKiller 
running in its 4-approx mode. Time T-EST is the running time (in seconds) of 
TerminusEst, and Time 4-approx is the running time (in seconds) of 
NonbinaryCycleKiller running in its 4-approx mode. In 202 instances 
TerminusEst returned the same size solution as rSPR, in 202 cases TerminusEst 
returned the same size solution as NonbinaryCycleKiller (running in 4-approx 
mode), and in 201 cases NonbinaryCycleKiller (running in 4-approx mode) 
returned the same size solution as rSPR. 



gene duplication events arising in genomes, some species 
host several copies of the same gene, hence individual 
gene trees usually have several leaves labeled with iden- 
tical species names. Unfortunately, our methods do not 
handle such multi-labeled gene trees (MUL trees). We 
thus transformed the MUL trees into trees containing 
single copies of labels, applying the tools described in 
[31,32] to the forest F of 7134 trees. As in Section 4.1 



Table 5 Summary of results for dataset F2 (1 003 gene 
trees) originally obtained from GreenPhylDB database 





MIN 


AVG 


MAX 


Common taxa 


3 


1 1 .704 


22 


opt 


0 


2.854 


10 


Ratio 4-approx 


1 


1.025 


2 


Ratio 6-approx 


1 


1.264 


3 


Gap (T-EST- MAF) 


0 


0.048 


1 


Gap (4-approx - MAF) 


0 


0.165 


3 


TimeT-EST 


0 


0.576 


7 


Time 4-approx 


0 


0.605 


3 



In 955 instances TerminusEst returned tine same size solution as rSPR, in 91 1 
cases TerminusEst returned the same size solution as NonbinaryCycleKiller 
(running in 4-approx mode), and in 880 cases NonbinaryCycleKiller (running in 
4-approx mode) returned the same size solution as rSPR. 



Table 6 Summary of results for dataset F3 (5924 gene 
trees) originally obtained from GreenPhylDB database 





MIN 


AVG 


MAX 


Common taxa 


Z 


1 A THA 
1 ^.ZUD 


zz 


opt 


0 


3.613 


12 


Ratio 4-approx 


1 


1.027 


2 


Ratio 6-approx 


1 


1.277 


3 


Gap (T-EST- MAF) 


0 


0.065 


2 


Gap (4-approx - MAF) 


0 


0.195 


4 


TimeT-EST 


0 


0.689 


21 


Time 4-approx 


0 


0.729 


3 



In 5553 instances TerminusEst returned the same size solution as rSPR, in 5297 
cases TerminusEst returned the same size solution as NonbinaryCycleKiller 
(running in 4-approx mode), and in 5030 cases NonbinaryCycleKiller (running 
in 4-approx mode) returned the same size solution as rSPR. 



of [31], we obtained four data sets: Fi, F2, F^ and F3, 
respectively containing 204, 1003, 5924 and 5789 trees. 
Note that only F^ contains nonbinary trees. Finally, for 
each single labeled tree G e (Fi U F2 U Ff U F|), 
we restricted the species tree S (containing 22 taxa) to 
the leaves of G and we applied our methods to all so 
obtained pairs (restricted S, G). The results are presented 
in Tables 4, 5, 6 and 7. For all four datasets both Ter- 
minusEst and NonbinaryCycleKiller ran extremely 
quickly, rarely taking more than a couple of seconds for 
each species-gene tree pair. Moreover, the clear conclu- 
sion with this dataset is that, although the species-gene 
pairs are often incompatible, there are rarely many cycles 
to kill and optimum solutions to the hybridization number 
problem are generally extremely close to optimal solutions 
to MAF. 



Table 7 Summary of results for dataset F^ (5789 gene 
trees) originally obtained from GreenPhylDB database 





MIN 


AVG 


MAX 


Common taxa 


3 


17.319 


22 


opt 


0 


1.560 


12 


Ratio 4-approx 


1 


1.021 


2 


Ratio 7-approx 


1 


1.704 


4 


Gap (T-EST- MAF) 


0 


0.053 


4 


Gap (4-approx - MAF) 


0 


0.132 


5 


TimeT-EST 


0 


0.422 


15 


Time 4-approx 


0 


1.182 


14 



In 5552 instances TerminusEst returned the same size solution as rSPR, in 541 5 
cases TerminusEst returned the same size solution as NonbinaryCycleKiller 
(running in 4-approx mode), and in 5209 cases NonbinaryCycleKiller (running 
in 4-approx mode) returned the same size solution as MAF. In this dataset the 
gene trees were also nonbinary, meaning that NonbinaryCycleKiller had to 
use the MAF algorithm described in [21 ] instead of rSPR. 
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Conclusions 

Our experiments with binary trees show that 
CycleKiller is much faster than available exact meth- 
ods once the input trees become sufficiently large and/or 
discordant. In over 96% of the cases CycleKiller finds 
the optimal solution and in the remaining cases it finds 
a solution very close to the optimum. We have shown 
that the most accurate mode of the program produces 
solutions that are at most a factor 2 from the optimum. 
In practice, the average-case approximation ratio that we 
observed was 1.003. The fastest mode of the algorithm 
can be used on trees with thousands of leaves and prov- 
ably constructs networks that are at most a factor of 4 
from the optimum. 

Our experiments with nonbinary trees highlight once 
again that the cycle-breaking technique described in this 
article is intrinsically linked to the current state-of-the- 
art in MAF algorithms. TerminusEst is faster than the 
most accurate mode of NonbinaryCycleKiller when 
both trees are nonbinary due to the fact that MAF solvers 
for two nonbinary trees have not yet been optimized 
to the same extent as their binary counterparts. In fact, 
TerminusEst is the best avaible exact method for non- 
binary trees and can handle instances for which the 
optimum is up to 15-20. For other instances, NONBINA- 
ryCycleKiller in its fastest mode is much faster than 
TerminusEst and produces solutions that are at most a 
factor 4 from the optimum (less than 2.6 on average). 

Finally, for instances with one binary and one nonbi- 
nary tree, the most accurate mode of NonbinaryCy- 
CLEKiller is again much faster than TerminusEst and 
produces solutions that are at most a factor 1.75 from the 
optimum (less than 1.011 on average). 

Endnotes 

^In [15], it has been shown that this number can go up 
to 40 when running Dendroscope on a similar processor 
but allocating all cores for one instance, i.e. exploiting the 
possibilities of parallel computation of this 
implementation. 

^We note that NonbinaryCycleKiller uses a 
row-generation ILP formulation - based on [33] - to solve 
DFVS, rather than the polynomial-size formulation used 
by CycleKiller. ILP is in neither case a bottleneck for 
the running time. 
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